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Abstract: This paper proposes a binarization scheme for vectors of high dimen- 
sion based on the recent concept of anti-sparse coding, and shows its excellent 
performance for approximate nearest neighbor search. Unlike other binarization 
schemes, this framework allows, up to a scaling factor, the explicit reconstruction 
from the binary representation of the original vector. The paper also shows that 
random projections which are used in Locality Sensitive Hashing algorithms, are 
significantly outperformed by regular frames for both synthetic and real data if 
the number of bits exceeds the vector dimensionality, i.e., when high precision 
is required. 

Key-words: sparse coding, spread representations, approximate neighbors 
search, Hamming embedding 
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Codage anti-parcimonieux pour la recherche 
approximative de plus proches voisins 

Resume : Get article proposes une technique de binarisation qui s'appuie 

sur le concept recent de codage anti-parcimonieux, et niontre ses excellentes 
performances dans un contexte de recherche approximative de plus proches 
voisins. Contrairement aux methodes concurrentes, le cadre propose permet, 
a un factcur d'cchcllc pies, la reconstruction explicite du vecteur encode a par- 
tir de sa representation binaire. L'article montre egalement que les projec- 
tions aleatoires qui sont communement utilisees dans les methodes de hachage 
multi-dimensionnel peuvent etre avantageusement remplacees par des frames 
regulieres lorsque le nombre de bits excede la dimension originale du descrip- 
teur. 

Mots-cles : codage parcimonieux, representations etalees, recherche approx- 
imative de plus proches voisins, binarisation 
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1 Introduction 

This paper addresses the problem of approximate nearest neighbor (ANN) 
search in high dimensional spaces. Given a query vector, the objective is to 
find, in a collection of vectors, those which are the closest to the query with 
respect to a given distance function. We focus on the Euclidean distance in 
this paper. This problem has a very high practical interest, since matching the 
descriptors representing the media is the most consuming operation of most 
state-of-the-art audio [IJ, image [2] and video [3] indexing techniques. There is 
a large body of literature on techniques whose aim is the optimization of the 
trade-off between retrieval time and complexity. 

We are interested by the techniques that regard the memory usage of the 
index as a major criterion. This is compulsory when considering large datasets 
including dozen millions to billions of vectors [H El HI E] i because the indexed 
representation must fit in memory to avoid costly hard-drive accesses. One 
popular way is to use a Hamming Embedding function that maps the real vec- 
tors into binary vectors |31 El Binary vectors are compact, and searching 
the Hamming space is efficient (XQR operation and bit count) even if the com- 
parison is exhaustive between the binary query and the database vectors. An 
extension to these techniques is the asymmetric scheme |71 [5] which limits the 
approximation done on the query, leading to better results for a slightly higher 
complexity. 

We propose to address the ANN search problem with an anti-sparse solution 
based on the design of spread representations recently proposed by Fuchs [S]. 
Sparse coding has received in the last decade a huge attention from both the- 
oretical and practical points of view. Its objective is to represent a vector in a 
higher dimensional space with a very limited number of non-zeros components. 
Anti-sparse coding has the opposite properties. It offers a robust representation 
of a vector in a higher dimensional space with all the components sharing evenly 
the information. 

Sparse and anti-sparse coding admits a common formulation. The algo- 
rithm proposed by Fuchs [9j is indeed similar to path-following methods based 
on continuation techniques like [TU]. The anti-sparse problem considers a £oo 
penalization term where the sparse problem usually considers the ii norm. The 
penalization in ||x||oo limits the range of the coefficients which in turn tend to 
'stick' their value to ±||x||oo ISj. As a result, the anti-sparse approximation 
offers a natural binarization method. 

Most importantly and in contrast to other Hamming Embedding techniques, 
the binarized vector allows an explicit and reliable reconstruction of the original 
database vector. This reconstruction is very useful to refine the search. First, 
the comparison of the Hamming distances between the binary representations 
identifies some potential nearest neighbors. Second, this list is refined by com- 
puting the Euclidean distances between the query and the reconstructions of 
the database vectors. 

We provide a Matlab package to reproduce the analysis comparisons reported 
in this paper (for the tests on synthetic data), see http : //www. irisa.fr/texmex/people/jegou/src .php 
The paper is organized as follows. Section [21 introduces the anti-sparse coding 
framework. Section [3l describes the corresponding ANN search method which is 
evaluated in Section 01 on both synthetic and real data. 
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2 Spread representations 

This section briefly describes the anti-sparse coding of [9|. We first introduce 
the objective function and provide the guidelines of the algorithm giving the 
spread representation of a given input real vector. 

Let A = [ai| . . . jam] he a. d x m {d < m) full rank matrix. For any y G M'', 
the system Ax — y admits an infinite number of solutions. To single out a 
unique solution, one add a constraint as for instance seeking a minimal norm 
solution. Whereas the case of the Euclidean norm is trivial, and the case of 
the £i-norm stems in the vast literature of sparse representation, Fuchs recently 
studied the case of the ^oo-norm. Formally, the problem is: 

X* = niin ||x||oo, (1) 

x: Ax— y 

with ||x||oo = ^^^£{i,...,m} Interestingly, he proved that by minimizing 

the range of the components, m — d + 1 of them are stuck to the limit, ie. 
Xi = ±||xl|oo- Fuchs also exhibits an efficient way to solve ([T]). He proposes to 
solve the series of simpler problems 

x;^ = min J,,(x) (2) 

xeR'" 

with 

J;,(x) = ||Ax-y||2/2 + /i||x|U (3) 
for some decreasing values of ft,. As ft, — !• 0, x*^ ^> x*. 



2.1 The sub-differential set 

For a fixed ft, Jh is not differentiable due to ||.||oo- Therefore, we need to work 
with sub-differential sets. The sub-differential set 9/(x) of function / at x is 
the set of gradients v s.t. /(x') - /(x) > v^(x' - x), Vx' G R". For / = ||.||oo, 
we have: 

9/(0) = {vGM™: ||v||i<l}, (4) 
a/(x) = {vGM": ||v||i-l, (5) 

ViXi > if = ||x||oo, 

Vi — else} , for X ^ 

Since Jh is convex, is solution iff belongs to the sub-differential set 9J/i(xJ^), 
i.e. iff there exist v G 9/(x^) s.t. 

A'^ {A^l - y) + hv = (6) 



2.2 Initialization and first iteration 

For fto large enough, Jh„{x.) is dominated by l|x||oo, and the solution writes 
x*j^ — and v — h^^A^y G 9/(0). ^ shows that this solution no longer holds 
for h < hi with fti = ||A^y||i. 

For ||x||oo small enough, J;i(x) is dominated by ||yp — x^A^y -I- ft||x||oo 
whose minimizer is x^ = ||x||ooSign(yl^y). In this case, 9/(x) is the set of 
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vectors v s.t. sign(v) — sign(x) and ||v||i = 1. Multiplying ^ by sign(v)^ on 
the left, we have 

/i = /ii-Psign(ATy)f ||x||^. (7) 

This shows that i) xj^ can be a solution for h < hi, and ii) ||x||oo increases as h 
decreases. Yet, Equation ([5]) also imposes that v — Ui — /X]^||x||oo, with 

1^1 ^ h-^A'^y and fi^ ^ h-^ A'^ Asign{A'^ y) . (8) 

But, the condition sign(v) = sign(x) from ([5]) must hold. This limits ||x||oo by 
Pi^ where pi — Vi/^i and i2 = argmini:p;>o(pi), which in turn translates to a 
lower bound ft,2 on h via ([7]). 



2.3 Index partition 

For the sake of simplicity, we introduce X ^ {1, . . . , m}, and the index partition 
X {i : \xi\ ~ |jx|joo} and X ~ X\X. The restriction of vectors and matrices to 
X (resp. X) are denoted alike x (resp. x). For instance. Equation ^ translates 
in sign(v) = sign(x), ||v||i = 1 and v = 0. The index partition splits (O into 
two parts: 

(ix + Asign(v)||x||o,) = A^y (9) 
fix + Asign(v) ||x||oo - y) = -h^ (10) 



For h2 < h < hi, we've seen that x = x, v = v, and A ^ A. Their 'tilde' 
versions are empty. For h < h2, the index partition X — X and 1 = can no 
longer hold. Indeed, when Vi^^ is null at /i = /12, the i2-th column of A moves 
from A to A s.t. now, A = [a^^]. 

2.4 General iteration 

The general iteration consists in determining on which interval [/ifc+i,/!^] an 
index partition holds, giving the expression of the solution x^ and proposing a 
new index partition to the next iteration. 
Provided A is full rank, ^ gives 

x = ^,+C,||x|U, (11) 

with 

= (i^i)-ii^y (12) 

and 

Ck = -(i^i)-iisign(v). (13) 

Equation [TU] gives: 

V = I/fc -/Xfe||x||oo, (14) 

with 

/x^ = A^{I- A^ A{A^ A)-^)Asign{^r)/h (15) 

and 

vu = {A^y~it,)/h. (16) 
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Left multiplying PTI)l by sign(v), we get: 



h 



Vk - Wfc||x|| 



OO 



(17) 



with 




(18) 



and 



i]k = -sign(v)^A^(ix - y). 



(19) 



Note that Vk > so that ||x||oc increases when h decreases. 

These equations extend a solution x*^ to the neighborhood of h. However, 
we must check that this index partition is still valid as we decrease h and ||x||oo 
increases. Two events can break the validity: 

• Like in the first iteration, a component of v given in (jl4p becomes null. 
This index moves from I to I. 

• A component of x given in pT|) sees its amplitude equalling ±||x||oo- This 
index moves from I to I, and the sign of this component will be the sign 
of the new component of x. 

The value of ||x||oo for which one of these two events first happens is translated 
in hk+i thanks to (jl7p . 

2.5 Stopping condition and output 

If the goal is to minimize Jf^ (x) for a specific target ht , then the algorithm stops 
when hk+i < ht. The real value of ||x^^ ||oo is given by (|17p . and the components 
not stuck to ±||x^ lloo by PT|) . 

We obtain the spread representation x of the input vector y. The vector x 
has many of its components equal to ±||x||oo- An approximation of the original 
vector y is obtained by 



3 Indexing and search mechanisms 

This section describes how Hamming Embedding functions are used for approx- 
imate search, and in particular how the anti-sparse coding framework described 
in Section [2] is exploited. 

3.1 Problem statement 

Let 3^ be a dataset of n real vectors, y — {yi, . . . ,y„}, where y^ e K*^, and 
consider a query vector q G M''. We aim at finding the k vectors in y that are 
closest to the query, with respect to the Euclidean distance. For the sake of 
exposure, we consider without loss of generality the nearest neighbor problem, 
i.e., the case k = 1. The nearest neighbor of q in y is defined as 



The goal of approximate search is to find this nearest neighbor with high 
probability and using as less resources as possible. The performance criteria are 
the following: 



y = Ax. 



(20) 



NN(q) =argminl|q-yf . 



(21) 
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• The quality of the search, i.e., to which extent the algorithm is able to 
return the true nearest neighbor ; 

• The search efficiency, typically measured by the query time ; 

• The memory usage, i.e., the number of bytes used to index a vector of 
the database. 

In our paper, we assess the search quality by the recallUR measure: over a set 
of queries, we compute the proportion for which the system returns the true 
nearest neighbor in the first R positions. 

3.2 Approximate search with binary embeddings 

A class of ANN methods is based on embedding [H [SI [5] . The idea is to map the 
input vectors to a space where the representation is compact and the comparison 
is efficient. The Hamming space offers these two desirable properties. The key 
problem is the design of the embedding function e : Mf^ ^ B™ mapping the 
input vector y to b = e(y) in the m-dimensional Hamming space B™, here 
defined as {—1, 1}'" for the sake of exposure. 

Once this function is defined, all the database vectors are mapped to B™, 
and the search problem is translated into the Hamming space based on the 
Hamming distance, or, equivalently: 

NNfa (e(q)) = argmaxe(q)^ e(y). (22) 
NNh(e(q)) is returned as the approximate NN(q). 

Binarization with anti-sparse coding. Given an input vector y, the anti- 
sparse coding of Section [2] produces x with many components equal to ±||x||oo- 
We consider a "pre-binarized" version i(y) — x/||x||oo, and the binarized version 
e(y) = sign(x). 

3.3 Hash function design 

The locality sensitive hashing (LSH) algorithm is mainly based on random pro- 
jection, though different kinds of hash functions have been proposed for the 
Euclidean space [TT]. Let A = [ai| . . . ja^] he a d x m matrix storing the m 
projection vectors. The most simple way is to take the sign of the projections: 
b = sign(A^y). Note that this corresponds to the first iteration of our algorithm 
(see Section !^ . 

We also try A as an uniform frame. A possible construction of such a frame 
consists in performing a QR decomposition on a m x m matrix. The matrix A 
is then composed of the d first rows of the Q matrix, ensuring that A x A'^ = Id- 
SectionUshows that such frames significantly improve the results compared with 
random projections, for both LSH and anti-sparse coding embedding methods. 

3.4 Asymmetric schemes 

As recently suggested in the literature, a better search quality is obtained by 
avoiding the binarization of the query vector. Several variants are possible. 



RR n° 7771 



Anti-sparse coding for approximate search 



8 



We consider the simplest one derived from (I22p . where the query is not bina- 
rized in the inner product. For our anti-sparse coding scheme, this amounts to 
performing the search based on the following maximization: 

NNa (e(q)) = argmaxi(q)^e(y). (23) 
yey 

The estimate NNa is better than NNf,. The memory usage is the same because 
the vectors in the database {e(yi)} are all binarized. However, this asymmetric 
scheme is a bit slower than the pure bit-based comparison. For better efficiency, 
the search ((23)) is done using look-up tables computed for the query and prior 
to the comparisons [81 This is slightly slower than computing the Hamming 
distances in (|22l) . This asymmetric scheme is interesting for any binarization 
scheme (LSH or anti-sparse coding) and any definition of A (either random 
projections or a frame). 

3.5 Explicit reconstruction 

The anti-sparse binarization scheme explicitly minimizes the reconstruction er- 
ror, which is traded in ([T]) with the regularization term. Equation (j20p 
gives an explicit approximation of the database vector y up to a scaling factor: 
y oc . The approximate nearest neighbors NNe are obtained by com- 

puting the exact Euclidean distances ||q — y^lb- This is slow compared to the 
Hamming distance computation. That is why, it is used to operate, like in , a 
re-ranking of the first hypotheses returned based on the Hamming distance (on 
the asymmetric scheme described in Section . The main difference with [5] 
is that no extra-code has to be retrieved: the reconstruction y solely relies on 
b. 

4 Simulations and experiments 

This section evaluates the search quality on synthetic and real data. In partic- 
ular, we measure the impact of: 

• The Hamming embedding technique: LSH and binarization based on anti- 
sparse coding. We also compare to the spectral hashing method of [5], 
using the code available online. 

• The choice of matrix A: random projections or frame for LSH. For the 

anti-sparse coding, we always assume a frame. 

• The search method: 1) NN,, of dUD 2) NN^ of dM]) and 3) NNe as described 
in Section [231 

Our comparison focuses on the case m > d. In the anti-sparse coding 
method, the regularization term h controls the trade-off between the robustness 
of the Hamming embedding and the quality of the reconstruction. Small values 
of h favors the quality of the reconstruction (without any binarization) . Bigger 
values of h gives more components stuck to ||x||oo, which improves the approx- 
imation search with binary embedding. Optimally, this parameter should be 
adjusted to give a reasonable trade-off between the efficiency of the first stage 
(methods NNf, or NNq) and the re-ranking stage (NNe). Note however that, 
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16 24 32 48 64 96 128 

m: number of bits 



Figure 1: Anti-sparse coding vs LSH on synthetic data. Search quahty (re- 
callQlO in a vector set of 10,000 vectors) as a function of the number of bits of 
the representation. 

thanks to the algorithm described in Section [51 the parameter is stable, i.e., 
a shght modification of this parameter only affects a few components. We set 
h ~ 1 in all our experiments. Two datasets are considered for the evaluation: 

• A database of 10,000 16-dimensional vectors uniformly drawn on the Eu- 
clidean unit sphere (normalized Gaussian vectors) and a set of 1,000 query 
vectors. 

• A database of SIFT [H] descriptors available onlin43, comprising 1 million 
database and 10,000 query vectors of dimensionality 128. Similar to [5J, 
we first reduce the vector dimensionality to 48 components using principal 
component analysis (PGA). The vectors are not normalized after PGA. 

The comparison of LSH and anti-sparse. Figures [1] and [5] show the per- 
formance of Hamming embeddings for synthetic data. On Fig. [1] the quality 
measure is the recallUlO (proportion of true NN ranked in first 10 positions) 
plotted as a function of the number of bits m. For LSH, observe the much bet- 
ter performance obtained by the proposed frame construction compared with 
random projections. The same conclusion holds for anti-sparse binarization. 

The anti-sparse coding offers similar search quality as LSH for m — d when 
the comparison is performed using NN^ of (j22p . The improvement gets signif- 
icant as m increases. The spectral hashing technique [3] exhibits poor perfor- 
mance on this synthetic dataset. 

The asymmetric comparison NN^ leads a significant improvement, as al- 
ready observed in [T) |H]. The interest of anti-sparse coding becomes obvious 

^http: / /corpus-texmex. irisa.fr 
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Figure 2: Anti-sparse coding vs LSH on synthetic data (to = 48, 10,000 vectors 
in dataset). 



by considering the performance of the comparison NNg based on the exphcit 
reconstruction of the database vectors from their binary-coded representations. 
For a fixed number of bits, the improvement is huge compared to LSH. It is 
worth using this technique to re-rank the first hypotheses obtained by NN^ or 
NN,. 

Experiments on SIFT descriptors. As shown by Figure [31 LSH is shghtly 
better than anti-sparse on real data when using the binary representation only 
(here m = 128), which might solved by tuning h, since the first iteration of 
antisparse leads the binarization as LSH. However, the interest of the explicit 
reconstruction offered by NNg is again obvious. The final search quality is 
significantly better than that obtained by spectral hashing [5]. Since we do not 
specifically handle the fact that our descriptor are not normalized after PCA, 
our results could probably be improved by taking care of the £2 norm. 



5 Conclusion and open issues 

In this paper, we have proposed anti-sparse coding as an effective Hamming 
embedding, which, unlike concurrent techniques, offers an explicit reconstruc- 
tion of the database vectors. To our knowledge, it outperforms all other search 
techniques based on binarization. There are still two open issues to take the 
best of the method. First, the computational cost is still a bit high for high 
dimensional vectors. Second, if the proposed codebook construction is better 
than random projections, it is not yet specifically adapted to real data. 
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Figure 3: Approximate search in a SIFT vector set of 1 million vectors. 
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